SAVERCAT is a method for dimension reduction and denoising of single-cell gene expression data that can flexibly adjust for arbitrary observed sample- and cell-level covariates. With SAVERCAT, you can:
Obtain a low-dimensional representation of the cells that is corrected for the effects of batch and other confounding covariates.
Remove the effects of batch and other confounding covariates in the original high-dimensional gene expression matrix.
Further denoise the data, that is, remove technical variation that is due to the inherent random sampling introduced during the library preparation and sequencing steps of the experiment.
This tutorial shows how to use SAVERCAT within a Seurat v3 workflow. For more details of this method and examples of its applications, see the paper (link …).
You can install SAVERCAT from github.
install.packages("devtools")
devtools::install_github("Janezjz/SAVERCAT")
SAVERCAT depends on Python versions of Tensorflow 2.0 and Keras for neural network model training. To install these, you can either call functions from R as follows:
tensorflow::install_tensorflow()
keras::install_keras()
Or you can install separately via Python.
library(SAVERCAT)
library(Seurat)
library(cowplot)
library(dplyr)
For this tutorial, we use a Peripheral Blood Mononuclear Cells (PBMC) dataset that is released together with our package. The dataset consists of PBMCs from three healthy donors sequenced using three different 10x technologies – VDJ, V2, and V3. The original raw data for each sequencing run can be obtained from 10X Genomics:
5’ prime: https://support.10xgenomics.com/single-cell-vdj/datasets/3.0.0/vdj_v1_hs_pbmc2_5gex_protein
V3: https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_10k_v3
V2: https://support.10xgenomics.com/single-cell-gene-expression/datasets/2.1.0/pbmc8k
In each dataset, we identified three main cell types – monocytes, T/NK cells, and B cells. Unidentified cells were removed after filtering out genes with non-zero expression in less than 2 cells. We randomly sampled 10,000 genes and 2,000 cells to include in the demo dataset pbmc_prep, which is included within SAVERCAT and used for this tutorial. An analysis of the original data set can be found in our paper (give link).
Load in the data.
data("pbmc_prep")
The raw UMI count matrix is stored in pbmc_prep$counts. The rows correspond to genes and columns correspond to cells. There are 10000 genes and 2000 cells in pbmc_small.
expr_mat = pbmc_prep$counts
str(expr_mat)
## num [1:10000, 1:2000] 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:10000] "NECAB2" "HTR3E-AS1" "AC021054.1" "AL161781.2" ...
## ..$ : chr [1:2000] "V3_9782" "VDJ_7666" "VDJ_4624" "V2_7939" ...
pbmc_prep$metadata consists of two cell-level metadata: tech (technology) and celltype. The cell type label will be used in our analyses only to assess the accuracy of covariate adjustment and denoising. Following the analysis in our paper, we will treat each technology as a batch and our goal will be to align the gene expression values across batches without blurring the distinctions between the cell types.
metadata = pbmc_prep$metadata
head(metadata)
| tech | celltype | |
|---|---|---|
| V3_9782 | V3 | T/NK cells |
| VDJ_7666 | VDJ | T/NK cells |
| VDJ_4624 | VDJ | B cells |
| V2_7939 | V2 | Monocytes |
| V3_2765 | V3 | Monocytes |
| V3_5953 | V3 | T/NK cells |
First, we use the raw counts and metadata extracted from pbmc_prep to create a Seurat object.
SeuratObject = CreateSeuratObject(counts = expr_mat, meta.data = metadata)
SeuratObject
## An object of class Seurat
## 10000 features across 2000 samples within 1 assay
## Active assay: RNA (10000 features, 0 variable features)
By performing the Seurat clustering workflow, we can see that cells separate according to both batch and cell type. So batches are confonded with celltypes.
all.genes = rownames(x = SeuratObject)
SeuratObject = NormalizeData(SeuratObject, assay = "RNA", verbose = FALSE)
SeuratObject = FindVariableFeatures(SeuratObject, assay = "RNA", selection.method = "vst", nfeatures = 2000, verbose = FALSE)
SeuratObject = ScaleData(SeuratObject, assay = "RNA", features = all.genes, verbose = FALSE)
SeuratObject = RunPCA(SeuratObject, assay = "RNA", reduction.name = "pca_rna", verbose = FALSE)
SeuratObject = RunTSNE(SeuratObject, assay = "RNA", reduction = "pca_rna", dims = 1:20, reduction.name = "tsne_rna")
p11 = Seurat::DimPlot(SeuratObject, reduction = "tsne_rna", group.by = "tech")
p21 = Seurat::DimPlot(SeuratObject, reduction = "tsne_rna", group.by = "celltype")
plot_grid(p11,p21)
The first step of the SAVERCAT workflow is to run the SaverPreprocess function. This function does several things:
It constructs a set of highly variable genes comprised of the highly variable genes found separately within each level of the covariate (in this example, batch).
It also constructs a design matrix for the covariates that we wish to corrected for. The covariates are assumed to be stored in meta.data. Names of categorical and continuous variables are passed to this function through the arguments categorical.var and continuous.var, respectively. By default, categorical.var=NULL and continous.var=NULL, in which case SAVERCAT corrects for only the scaled log cell library size.
It prepares inputs and outputs for neural network model training.
The main inputs of SaverPreprocess are:
SeuratObject: A Seurat object.
assay.name: Name of assay where count data is stored. Default is ‘RNA’.
slot.name: Name of slot where raw count data is stored. Default is ‘counts’.
categorical.var: Names of categorical variables to be adjusted by SAVERCAT. Default is NULL.
continous.var: Names of continuous variables to be adjusted by SAVERCAT. Default is NULL.
This function outputs a Seurat object with a list of information used to train SAVERCAT stored in @tools$preprocess. For all intents and purposes, you don’t need to worry about what’s stored in there. But here we will give you a glimpse:
B: Observed covariates matrix.
hvg: Names of highly variable genes.
sf: Cell specific size factors.
x: A submatrix of the expression count matrix. The rows correspond to cells and colomns correspond to highly variable genes or genes with a mean expression of more than perc.exp. This will be taken as the true outputs for the decoder to learn.
x.norm: Log normailized and standardized form of x. x.norm concatenated with B will be the input to the encoder.
For pbmc_prep dataset, we set categorical.var = c("tech"), continous.var = NULL. Raw counts are stored in “RNA” assay, and thus we call SaverPreprocess as follows.
SeuratObject = SaverPreprocess(SeuratObject, assay.name = "RNA", slot.name = "counts", categorical.var = c("tech"), continous.var = NULL)
## Training on 3000 highly variable genes.
For curiousity’s sake let’s look at what’s stored in @tools$preprocess$B. The first column is scaled log library size. The other three columns are scaled indicator variable for each batch.
head(as.data.frame(SeuratObject@tools$preprocess$B))
| libsize | VDJ | V2 | V3 | |
|---|---|---|---|---|
| V3_9782 | 0.1694465 | -0.613412 | -0.645153 | 1.1451998 |
| VDJ_7666 | 1.1953229 | 1.629411 | -0.645153 | -0.8727734 |
| VDJ_4624 | -0.3527182 | 1.629411 | -0.645153 | -0.8727734 |
| V2_7939 | -0.3136720 | -0.613412 | 1.549245 | -0.8727734 |
| V3_2765 | 2.0623935 | -0.613412 | -0.645153 | 1.1451998 |
| V3_5953 | 0.6845039 | -0.613412 | -0.645153 | 1.1451998 |
Take a look at the highly variable genes, stored in @tools$preprocess$hvg:
head(SeuratObject@tools$preprocess$hvg)
## [1] "FCGR3A" "IGLC2" "HLA-DRA" "GZMB" "FCER1A" "PF4"
First, we demonstrate how to use SAVERCAT to learn a low-dimensional representation of origianl gene expression matrix while correcting for effects of batch and other observed confounding variables. The function that does this job is LearnLatentRepresentation.
The main inputs of this function are:
SeuratObject: A Seurat object.
assay.name: Name of assay where count data is stored. Default is ‘RNA’.
slot.name: Name of slot where raw count data is stored. Default is ‘counts’.
save.model: Folder specified to save the model.
verbose: Verbosity mode. 0 is silent, 1 is progress bar, 2 is one line per epoch. Default is 2.
LearnLatentRepresentation returns the original Seurat object with the reduced dimension embedding stored in @reductions$saverc, a DimReduc slot. The trained model will be stored in the specified folder. Here we go:
SeuratObject = LearnLatentRepresentation(SeuratObject, assay.name = "RNA", slot.name = "counts", save.model = "data/cvae_model", verbose=0)
## KL annealing for 52 epochs.
On a computer with Intel(R) Core(TM) i7-4770HQ processor @ 2.20GHz, 16GB of memory and 4 cores, this step takes about 4-5 minutes.
SeuratObject@reductions$saverc
## A dimensional reduction object with key saverc_
## Number of dimensions: 24
## Projected dimensional reduction calculated: FALSE
## Jackstraw run: FALSE
## Computed using assay: RNA
Take a look at the learned embedding:
head(Embeddings(SeuratObject, reduction = "saverc")[, 1:5])
## saverc_1 saverc_2 saverc_3 saverc_4 saverc_5
## V3_9782 0.1386935 0.3832377 -0.06396260 0.10498183 0.3440343
## VDJ_7666 1.1370139 -0.8533904 -0.06350582 -0.07492319 1.3053000
## VDJ_4624 -0.9202040 0.3245428 -0.06255666 -0.01295023 1.3481779
## V2_7939 -0.8776860 -0.5917239 -0.06330741 -0.07481068 -0.8258450
## V3_2765 -0.7832297 -0.8559683 0.22213639 -0.07610734 -0.8338816
## V3_5953 -0.1395055 0.2717909 -0.06435148 0.10438860 2.5158727
This low-dimensional representation can then be used for visualization and downstream analysis within any Seurat workflow. You just need to specify that this is the dimensional reduction slot to use, by setting reduction = "saverc". Here we first plot UMAP and then cluster cells by a nearest neighbor graph computed on this embedding.
SeuratObject = Seurat::RunUMAP(SeuratObject, reduction = "saverc", reduction.name = "umap_saverc", dims = 1:dim(SeuratObject@reductions$saverc)[2])
p1 = Seurat::DimPlot(SeuratObject, reduction = "umap_saverc", group.by = "tech")
p2 = Seurat::DimPlot(SeuratObject, reduction = "umap_saverc", group.by = "celltype")
plot_grid(p1,p2)
SeuratObject = Seurat::FindNeighbors(SeuratObject, dims = 1:10, reduction = "saverc")
## Computing nearest neighbor graph
## Computing SNN
SeuratObject = Seurat::FindClusters(SeuratObject, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 2000
## Number of edges: 57152
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8423
## Number of communities: 7
## Elapsed time: 0 seconds
head(Idents(SeuratObject))
## V3_9782 VDJ_7666 VDJ_4624 V2_7939 V3_2765 V3_5953
## 1 0 2 3 4 1
## Levels: 0 1 2 3 4 5 6
Next, to correct for covariates and denoise the original gene expression matrix, we need to train a decoder that maps the latent representation back to the original gene expression space. This is a time-consuming step, partly because we employ cross-validation to guard against over-fitting (a.k.a. over-smoothing of the data, which may introduce artifacts into the data.) The function for this step is DecodeLatentRepresentation, which does not yet return a corrected gene expression matrix, but instead saves a memory-efficient “decoder” model. This model can then be used for fast and flexible adjustment of all (or a subset) of the covariates and for denoising.
The main inputs of DecodeLatentRepresentation are:
SeuratObject: A Seurat object with SAVERCAT low-dimensional representation stored in @reductions$saverc.
assay.name: Name of assay where count data is stored. Default is ‘RNA’.
slot.name: Name of slot where raw count data is stored. Default is ‘counts’.
save.model: Folder specified to save the model.
verbose: Verbosity mode. 0 is silent, 1 is progress bar, 2 is one line per epoch. Default is 2.
It returns a Seurat object with training details stored in SeuratObject@tools$train. The trained decoder model is stored in the folder specified in save.model.
SeuratObject = DecodeLatentRepresentation(SeuratObject, assay.name = "RNA", slot.name = "counts", save.model = "data/cvae_model", verbose=0)
## Calculating validation loss...
## Calculating validation loss...
## Calculating validation loss...
## Calculating validation loss...
On a computer with Intel(R) Core(TM) i7-4770HQ processor @ 2.20GHz, 16GB of memory and 4 cores, this step takes about 5-7 minutes.
Now we can use the low-dimensional latent representation produced by LearnLatentRepresentation function and decoder model trained by DecodeLatentRepresentation function to perform covariate correction and denoising of the gene expression matrix. The function for doing this is SaverDenoise.
Inputs to SaverDenoise function:
SeuratObject: A Seurat object with SAVERCAT low-dimensional representation stored in @reductions$saverc.
assay.name: Name of assay where count data is stored. Default is ‘RNA’.
slot.name: Name of slot where raw count data is stored. Default is ‘counts’.
save.model: Folder where well trained decoder model is stored.
subset.genes: Names of a subset of genes for which the covariate-correction and denoising will be performed. Default is NULL, in which case the procedure will be applied to all genes.
SaverDenoise returns a Seurat object with three new assays:
saverdenoised: the denoised estimates based on the posterior mean, which is a weighted average of the corrected counts (savercovadj) and the autoencoder covariate-corrected prediction. These are the values that you should use for visualization (e.g. heatmaps), clustering, and trajectory inference.
saversamp: For each gene in each cell, a sample from its Bayesian posterior distribution under the SAVER model. Thus, the values in this matrix reflect the uncertainty of the SAVER denoised estimates. These are the values to use if you are computing gene dispersion values, gene-gene correlations, or, in general, quantifying the uncertainty or variance of genes across cells.
savercovadj: covariate-corrected counts using cdf matching centered on the autoencoder predictions. If you don’t want to do any denoising or smoothing of the data, and just want to correct for the covariates, these are the values to use. Also, differential expression analyses using savercovadj values will allow for correct type I error control.
In each of the above newly created assays, the denoised/covariate adjusted/sampled counts are stored in the counts slot.
To demonstrate, for the pbmc_prep dataset, we will predict for all genes. Since denoised (and covariate adjusted) data are no longer sparse, and thus may be memory consuming if you do it for all genes at once, you may be interested in only running this step on a subset of genes. You can specify the subset of genes for which you want denoised- and covariate-corrected values by passing their index to the parameter subset.genes. For example, you can set subset.genes = c("IL1B", "CD36", "CD6", "CD7", "HLA-DRA") then you’ll only perform covariate-correction and denoising for these 5 genes.
This step took 30-50 seconds to run on a computer with 16GB memory and 4 cores.
SeuratObject = SaverDenoise(SeuratObject, assay.name = "RNA", slot.name = "counts", save.model = "data/cvae_model/decoder", subset.genes = NULL)
## Calculating SAVER output
## 0% 10% 20% 30% 40% ...50% 60% 70% 80% 90% 100% ...
SeuratObject
## An object of class Seurat
## 40000 features across 2000 samples within 4 assays
## Active assay: RNA (10000 features, 2000 variable features)
## 3 other assays present: savercovadj, saverdenoised, saversamp
## 4 dimensional reductions calculated: pca_rna, tsne_rna, saverc, umap_saverc
Take a look at the denoised/sampled/covariate adjusted counts.
# denoised SAVER estimates
GetAssayData(object = SeuratObject, assay = "saverdenoised", slot = "counts")[1:6,1:6]
## 6 x 6 sparse Matrix of class "dgCMatrix"
## V3_9782 VDJ_7666 VDJ_4624 V2_7939 V3_2765 V3_5953
## NECAB2 0.00007 0.00001 0.00001 0.00104 0.00045 0.00001
## HTR3E-AS1 . . . . . .
## AC021054.1 0.02838 0.02823 0.02833 0.02825 0.02823 0.02828
## AL161781.2 0.00082 0.00039 0.16253 0.00185 0.01687 0.00151
## ARMC10 0.19646 0.19569 0.19718 0.19626 0.19860 0.19569
## SLC26A6 0.03054 0.03049 0.03051 0.03054 0.03049 0.03050
# sampled SAVER estimates.
GetAssayData(object = SeuratObject, assay = "saversamp", slot = "counts")[1:6,1:6]
## 6 x 6 sparse Matrix of class "dgCMatrix"
## V3_9782 VDJ_7666 VDJ_4624 V2_7939 V3_2765 V3_5953
## NECAB2 0.00006 0.00001 0.00001 0.00134 0.00050 0.00002
## HTR3E-AS1 . . . . . .
## AC021054.1 0.02369 0.03441 0.01991 0.02584 0.04216 0.01082
## AL161781.2 0.00871 . 0.01117 0.00009 0.01828 0.00061
## ARMC10 0.15391 0.28133 0.19816 0.13495 0.11811 0.22002
## SLC26A6 0.01804 0.01397 0.04209 0.01824 0.03889 0.02756
# SAVER covariate corrected counts
GetAssayData(object = SeuratObject, assay = "savercovadj", slot = "counts")[1:6,1:6]
## 6 x 6 sparse Matrix of class "dgCMatrix"
## V3_9782 VDJ_7666 VDJ_4624 V2_7939 V3_2765 V3_5953
## NECAB2 . 1e-05 0.00001 0.00062 . .
## HTR3E-AS1 . . . . . .
## AC021054.1 0.01292 . 0.00817 0.00126 . 0.00439
## AL161781.2 0.00013 . 0.03640 0.00106 . 0.00006
## ARMC10 0.03144 . 0.06110 0.02333 0.11894 .
## SLC26A6 0.00842 . 0.00365 0.00807 . 0.00207
Next we show how to use these denoised/covariate adjusted/sampled counts in differential expression testing and visualization.
To identify marker genes in each cell type, you can use either the saversamp or the savercovadj assay. Here we perform differential expression using saversamp.
DefaultAssay(SeuratObject) = "saversamp"
all.genes = rownames(x = SeuratObject)
SeuratObject = NormalizeData(SeuratObject, verbose = FALSE)
SeuratObject = FindVariableFeatures(SeuratObject, selection.method = "vst", nfeatures = 2000, verbose = FALSE)
SeuratObject = ScaleData(SeuratObject, features = all.genes, verbose = FALSE)
Idents(object = SeuratObject) = 'celltype'
markers = FindAllMarkers(SeuratObject, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
## Calculating cluster Monocytes
## Calculating cluster T/NK cells
## Calculating cluster B cells
Take a look at the top 2 marker genes for each cell type.
top2 = markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_logFC)
top2
| p_val | avg_logFC | pct.1 | pct.2 | p_val_adj | cluster | gene |
|---|---|---|---|---|---|---|
| 0 | 2.860308 | 0.994 | 0.930 | 0 | Monocytes | FCN1 |
| 0 | 4.205225 | 0.999 | 0.998 | 0 | Monocytes | S100A8 |
| 0 | 2.074046 | 0.966 | 0.889 | 0 | T/NK cells | CD3D |
| 0 | 2.320547 | 0.977 | 0.831 | 0 | T/NK cells | IL7R |
| 0 | 3.411753 | 1.000 | 0.950 | 0 | B cells | IGHM |
| 0 | 3.237347 | 0.993 | 0.768 | 0 | B cells | IGLC2 |
As mentioned above, the SAVER sampled values align the distributions by batch while maintaining variation across cell types. To compare of gene expression distributions across cell types, we recommend you to use saversamp assay.
VlnPlot(SeuratObject, features = top2$gene, assay = "saversamp", log = TRUE, pt.size = 0.8)
Here we also draw violin plots using the SAVER denoised estimates for comparison purpose. As you can see, there is less variation between cells within each cell type using the SAVER denoised estimates.
SeuratObject = NormalizeData(SeuratObject, assay="saverdenoised", verbose = FALSE)
SeuratObject = FindVariableFeatures(SeuratObject, assay="saverdenoised", selection.method = "vst", nfeatures = 2000, verbose = FALSE)
SeuratObject = ScaleData(SeuratObject, assay="saverdenoised", features = all.genes, verbose = FALSE)
VlnPlot(SeuratObject, features = top2$gene, assay = "saverdenoised", log = TRUE, pt.size = 0.8)
To visualize gene expression and perform trajectory analyses, you should use SAVER estimate in saverdenoised assay. This is the posterior mean of the gene expression level. Intuitively, this estimate emulates the data that you would have gotten, if there were no batch effects and if deeper sequencing were performed on your sample.
We draw a heatmap of the denoised gene expression for top 20 marker genes.
DefaultAssay(SeuratObject) = "saverdenoised"
top20 = markers %>% group_by(cluster) %>% top_n(n = 20, wt = avg_logFC)
DoHeatmap(SeuratObject, features = unique(top20$gene), assay = "saverdenoised", angle = 0)
For comparison, we also plot the heatmap of the raw gene expression for the top 20 marker genes. Compared to the former plot, there is more noise in the following plot.
DoHeatmap(SeuratObject, features = unique(top20$gene), assay = "RNA", angle = 0)
Denoised estimates should be used for visualizing genes on the dim-reduction plot.
For pbmc_prep dataset, we first perform UMAP on the latent dimension reduction (saverc) and store the embedding in umap_saverc. Then we color cells by the marker genes from the denoised matrix with FeaturePlot function.
SeuratObject = Seurat::RunUMAP(SeuratObject, reduction = "saverc", reduction.name = "umap_saverc", dims = 1:dim(SeuratObject@reductions$saverc)[2])
DefaultAssay(SeuratObject) = "saverdenoised"
FeaturePlot(SeuratObject, features = top2$gene, reduction = "umap_saverc")